Supplementary information: Lacking alignments? The next-generation sequencing mapper segemehl revisited

نویسندگان

  • Christian Otto
  • Peter F. Stadler
  • Steve Hoffmann
چکیده

Each artificial dataset consists of 100 000 singleor paired-end reads and was simulated using Mason v0.1.1 [1] from the Human genome (hg19), excluding haplotypes, random contigs, and ‘non-chromosomal’ sequences. For the single-end Illumina datasets, Mason was run in Illumina mode with parameters -hn2, -sq, and the read length (-n) set to 100 and 30 for long and short reads, respectively. For the paired-end dataset, additionally, the parameters -mp, -ll 375, and -le 100 were specified and the read length (-n) was again set to 100. The artificial 454 dataset was simulated in 454 mode with the parameter -hn 2, -sq, -k 0.3, -bm 0.4, -bs 0.2, and -nm 400 (analogously to Langmead & Salzberg [2]). The real datasets were downloaded, converted to fastQ format, some of them post-processed, and downsampled to 100 000 singleor paired-end reads. The Illumina DNA-seq dataset was used as both single-end (by only using the first read sequences) and paired-end data. For both Illumina mRNA-seq datasets, the postprocessing involved removing reads that possibly overlapped exon-exon junctions. To achieve this, the entire dataset was mapped using segemehl (with -S option), STAR [3], TopHat2 [4], and SOAPsplice [5], and reads which were split-mapped by any of these tools were removed prior to down-sampling. In case of the paired-end mRNA-seq dataset, only paired-end reads were kept where both ends were not split-mapped by any of the tools. For Illumina shortRNA-seq, 3-adapter contaminations on the read sequences were clipped using fastx clipper (part of the FASTX-Toolkit) with the Illumina shortRNA-seq adapter (TCGTATGCCGTCTTCTGCTTGT). Before down-sampling, reads outside of the expected length range (19-25nt) were discarded. An overview of the benchmarking datasets, their sequencing platforms, library types, and average read lengths is given in Supplementary Table S1. To permit full reproducability, we have assembled an Electronic Supplement comprising all data, custom scripts, and detailed information on how to re-run the benchmarks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lacking alignments? The next-generation sequencing mapper segemehl revisited

MOTIVATION Next-generation sequencing has become an important tool in molecular biology. Various protocols to investigate genomic, transcriptomic and epigenomic features across virtually all species and tissues have been devised. For most of these experiments, one of the first crucial steps of bioinformatic analysis is the mapping of reads to reference genomes. RESULTS Here, we present thorou...

متن کامل

MMR: a tool for read multi-mapper resolution

MOTIVATION Mapping high-throughput sequencing data to a reference genome is an essential step for most analysis pipelines aiming at the computational analysis of genome and transcriptome sequencing data. Breaking ties between equally well mapping locations poses a severe problem not only during the alignment phase but also has significant impact on the results of downstream analyses. We present...

متن کامل

Specificity control for read alignments using an artificial reference genome-guided false discovery rate

MOTIVATION Accurate estimation, comparison and evaluation of read mapping error rates is a crucial step in the processing of next-generation sequencing data, as further analysis steps and interpretation assume the correctness of the mapping results. Current approaches are either focused on sensitivity estimation and thereby disregard specificity or are based on read simulations. Although contin...

متن کامل

The mapping task and its various applications in next-generation sequencing

The aim of this thesis is the development and benchmarking of computational methods for the analysis of high-throughput data from tiling arrays and next-generation sequencing. Tiling arrays have been a mainstay of genome-wide transcriptomics, e.g., in the identification of functional elements in the human genome. Due to limitations of existing methods for the data analysis of this data, a novel...

متن کامل

Toolbox for Mobile-Element Insertion Detection on Cancer Genomes

Mobile elements constitute greater than 45% of the human genome as a result of repeated insertion events during human genome evolution. Although most of mobile elements are fixed within the human population, some elements (including ALU, long interspersed elements (LINE) 1 (L1), and SVA) are still actively duplicating and may result in life-threatening human diseases such as cancer, motivating ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014